Welcome everybody to today's deep learning lecture. Today we want to talk a bit about
common practices, the stuff that you need to know to get everything implemented in practice.
So I have a small outline over in the next couple of videos and the topics we will look
at. So we will think about the problems that we currently have and how far we went. Then
we talk about training strategies, again optimization and learning rate and a couple of tricks how
to adjust them. Architecture selection and hyperparameter optimization. One trick that
is really useful is ensembling and typically people have to deal with class imbalance and
of course there's also very interesting approaches how to deal with them. So finally we look
into the evaluation and how to get a good predictor how well our network is actually
performing. Okay so far we have seen all the nuts and bolts of how to train the network.
We have the fully connecting convolutional layers, we have the activation function, the
loss function, optimization, regularization and today we will talk about how to choose
the architecture, train and evaluate a deep neural network. And the very first thing is
test data. Test data goes into the vault. Ideally the test set should be kept in a vault
and be brought only out at the end of the data analysis as Hasty and colleagues are
teaching in the elements of statistical learning. So first things first, overfitting is extremely
easy with neural networks. Again ImageNet random labels. So true test set error and
generalization can be underestimated substantially when you use the test set for model selection.
Not a good idea. So when we choose the architecture that's typically the first element in the
model selection and this should never be done on the test set. So we can do initial experimentation
on a smaller subset of the data, try to figure out what works but never work on the test
set when you're doing these things of selecting the architecture. Okay so let's look at a
couple of training strategies. Before the training check your gradients, check the loss
function, check own layer implementations that they compute correctly. And if you implemented
your own layer then compare the analytic and the numerical gradient. And you can use the
center differences for the numeric gradient, then you can use relative errors instead of
absolute differences and consider the numerics. Use double precision for checking, temporarily
scale the loss function if you observe very small values and choose your age for the step
size appropriately. Then we have a couple of additional recommendations. If you only
use a few data points then you will have less issues with non-differentiable parts of the
loss function. You can train the network for a short period of time and only then perform
the gradient checks. You can check the gradient first, then with regularization terms. So
if you first turn the regularization terms off, check the gradient and then with the
regularization terms and you turn off data augmentation and dropout. So you typically
make this checks on rather small data sets. So the goal of the initialization is that
you have a correct random initialization of the layers so you can compute the loss for
each class on the untrained network with regularization turned off. And of course that should give
a random classification within untrained network. So you then can compare the loss with the
loss achieved when deciding for a class randomly and they should be the same because you randomly
initialize and then you repeat with multiple random initializations just to check that
there's nothing wrong with the initialization. Let's go to the training. First you check
whether the architecture is in general capable of learning the task. So before training the
network on the full data set you take a small subset of the data, maybe 5 to 20 samples
and then try to overfit the network to get a zero loss. With such few samples you should
be able to memorize the entire data set and try to get a zero loss. So then you know that
your training procedure actually works and you can really go down to the zero loss and
optionally you can turn off the regularization because it may hinder this overfitting procedure.
Now if the network can't overfit you may have a bug in the implementation. Your model may
be too small so you may want to increase the parameters or the model capacity or simply
the model may not be suitable for this task. Also get a first idea about how the data,
Presenters
Zugänglich über
Offener Zugang
Dauer
00:10:53 Min
Aufnahmedatum
2020-05-31
Hochgeladen am
2020-05-31 22:56:35
Sprache
en-US
Deep Learning - Common Practices Part 1
This video discusses the use of validation data and how to choose optimizers, monitor weights, and set learning rates including their annealing.
Further Reading:
A gentle Introduction to Deep Learning